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In this paper, a statistical analysis of the structure of one blog community, a kind of social 
networks, is presented. The quantities such as degree distribution, clustering coefficient, average 
shortest path length are calculated to capture the features of the blogging network. We demonstrate 
that the blogging network has small-world property and the in and out degree distributions have 
power-law forms. The analysis also confirms that blogging networks show in general disassortative 
mixing pattern. Furthermore, the popularity of the blogs is investigated to have a Zipf 's law, namely, 
the fraction of the number of page views of blogs follows a power law. 



Introduction 

The recent development of so-called network science re- 
veals the underlying structures of complex networks and 
becomes the catalyst for arising common voice of inter- 
disciplinary fields to tame the complexity Q, Q, HI 0- 
The small-world network model proposed by Watts and 
Strogatz quantitatively reflected that the real networks 
are small worlds which have high clustering and short 
average path length 0. The six degree of separation, 
uncovered by the social psychologist Stanley Milgram, 
is the most famous manifestation of small-world the- 
ory The real world, however, significantly devi- 
ated from classic Erdos-Rcnyi model that the degree 
distribution is right skew, namely, follows a power law 
other than Poisson distribution [3, 0, I n particu- 
lar, for most networks, including the World Wide Web, 
the Internet and the metabolic networks, the degree dis- 
tribution has a power-law tail — p{k) ~ fc -7 . Such 
networks are called scale free and the Barabasi-Albert 
model (BA model) provides a possible generating mech- 
anism for such scale-free structure: growth and preferen- 
tial attachment 8]. These pioneering discoveries inten- 
sively attracted a large number of scientist from differ- 
ent background to plunge into this emerging immature 
realm. Besides, the real networks are hierarchical and 
have communities structure or composed of the elements 
— motifs [13, IT^. Nevertheless, surprisingly, it is found 
that the complex networks are self-similar, correspond- 
ing to the ubiquitous geometry pattern in snowflakes 
Meanwhile, the dynamics taking placing on com- 
plex networks such as virus spreading (information prop- 
agation), synchronization processes, games and coopera- 
tion, have been deeply investigated and well understood 

The word blog is short for neologism "Web log" , which 
is often a personal journal maintained on the Web. In 
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the past few years, blogs are the fastest growing part 
of the WWW 21] . There are now about 20 million 
blogs, which are emerging as an important communica- 
tion mechanism by an increasing number of people |22| . 
The Web in its first decade was like a big online library. 
Today, however, it becomes more of social web, not unlike 
Berners-Lee's original vision. Consequently, advanced 
social technologies — Blog, Wiki, Podcasting, RSS, etc, 
which featured as characteristics of time of Web 2.0 - 
have led to the change of the ways of people's thinking 
and communicating. We refer blogistan as blog space in 
the jargon of the blog field. As one surfs in blogistan, 
the global blogistan is just like an ecosystem called blo- 
gosphere that has a life of its own. In the view of complex 
adaptive system, the whole blogosphere is more than the 
sum of its weblogs. Therefore, one can't understand the 
blogosphere by studying one single weblog. Moreover, 
some interesting phenomena corresponding to the classic 
ecological patterns — predators and prey, evolution and 
emergence, natural selection and adaptation — are ubiq- 
uitous in blogosphere, where evolutionary forces plays 
out in real time. For instance, individual weblogs vie for 
niche status, establish communities of like-minded sites, 
and jostle links to their sites [2^. Besides, the fascinating 
and powerful filtering effect, namely, collaborative filter- 
ing is created by the dynamic hierarchy of links and rec- 
ommendations generated by blogs. The more bloggers 
there are in a particular community, the more efficient 
this filtering becomes, so, counter-intuitively, reducing 
information overload [22| . 

A typical blog is one long Web page on content host- 
ing site that provides blog space. It is basically a large 
queue with additions appearing at the top of the page 
and older material scrolling down, often partitioned into 
archives and with links to other blogs within the same 
host site (internal links) or to URLs in the Web (exter- 
nal links). Sometimes, personal blogs could cite para- 
graphs of other blogs, often embedded with links that 
could be collected by the blog hosting sites and return 
feedback to the original bloggers (the term trackback is 
used in the blog commnunity). At first glance, blogs 
are apparently nothing more than common Web pages. 
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Nevertheless, active blogs are updated with a frequency 
significantly higher than a traditional Web page, often 
in a bursty manner. The number and quality of links 
from a blog are quite different from ordinary Web pages. 
The links are updated more frequently by the bloggers 
and a significant fraction of the links are to other blogs. 
Furthermore, blogosphere creates an instant online com- 
munities of diverse topics for bloggers and readers who 
could publish their comments on blogs. Therefore it is 
more interactive and open than common Web pages. In 
this sense, the blogosphere is worth scrutinizing to reveal 
underlying mechanism for these interesting phenomena. 

In this paper, we concentrate on the sub-ecosystem of 
global blogosphere: the blogs hosted by Sina which is the 
largest Chinese blog space provider and has about 2 mil- 
lion registered users in mainland of China |29| . We are 
interested in the emerging links pattern between Sina 
blogs, i.e., the collections of links to bloggers' favorite 
blog sites. For simplicity, the links out of the domain 
(http://blog.sina.com.cn) are omitted. And also, the 
Zipf's law in popularity of the blogs is investigated. 

The remainder of this paper is organized as follows. 
Sec. II deals with the method of data collection, and 
Sec. Ill performs the statistical analysis of the structures 
of such self-organized blogosphere, including average de- 
gree, degree distribution, clustering coefficient, etc. Fi- 
nally, Sec. IV lays out the conclusion and future work is 
presented. 



Data gathering 

Since there are around 20 million blogs, we focused our 
eyesight in a sub-community of global blogosphere — the 
Chinese blogs hosted on Sina. We wanted to examine the 
structures of such self-organized "ecosystem", including 
the emerging interconnected pattern of Sina blogs and 
the Zipf's law in popularity of blogs. This would be the 
first stride to explore the mysteries of vivid blogosphere. 

The blog sites of Sina are very regular, and 
the entry to blog has two equivalent forms: (a) 
http://blog.sina.com.cn/rn/XXXX, where XXXX is 
a string consisted of letters and numbers; (b) 
|http:/ /blog.sina.com.cn/u/xxxxxj where xxxxx is a 10 
digits number as user's id. For all users, they have (b) site 
forms of their blogs. While for advanced users, they both 
have (a) and (b) forms of entries to their blogs. As some 
bloggers' sites both have (a) and (b), the mapping rela- 
tionships between (a) and (b) are established to avoid the 
reduplicate results. We designed a simple WWW robot 
which began with the most popular blog,which ranked 
first in global blogosphere by Technorati [20]. This pop- 
ular blog's number of page views has been more than 
30 million. Along with the collections of links to favorite 
Sina blogs, the robot crawled down a connected networks 
of 200, 399 nodes, using breadth-first search method. At 
the same time, the page views of each visited blogs were 
recorded down. Based upon these data, the analysis of 
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FIG. 1: (Color online) Cumulative distribution of in-degree 
of blogs. The straight red line is the linear fit of the data, 
whose slope is —1.34 ± 0.001. 

the structure of self-organized blogosphere was carried 
out in next section. 



Results 

The emerging link pattern of Sina blogs is mined from 
the crawled down networks. Since the nature of the net- 
work is directed, thus the connectivity of the blog has in- 
coming and outgoing connections, namely, kt n and k out 
respectively. The in-degree could be used as an index 
of importance of the blogs. From Fig. ^ we found that 
the cumulative distribution of in-degree obeys a power- 
law form, P in (k > K) ~ K~ a , where a = 1.34 ± 0.001. 
Therefore, the in-degree distribution, which indicates the 
probability that randomly chosen node i has k incom- 
ing connections, follows a power law, Pj n (k) ~ fc~ 7i ™, 
where j in = a + 1 = 2.34 ± 0.001 31]. The cumula- 
tive distribution of out-degree has a power-law tail as 
P out (k > K) ~ K~ p , where /3 = 2.60 ± 0.02 (see Fig. H 
for details) . Thereby, the out-degree distribution has the 
form P out (k) ~ k lout , where 7 = + 1 = 3.60 ± 0.02. 
By contrast, the out-degree distribution slightly deviates 
from the right skew heavy tail for small out-degree k ou t- 
Paradoxically, someone may argue that the log-normal 
distribution would better fit the data than the power law. 
Yet, we think that there exists a threshold as certain k out 
and when out-degree exceeds that threshold, a power-law 
tail exists, given the evidence that most of the data fall 
into the right skew tail [32| . 

In our collected population of inter-connected blogs, 
the maximum in-degree is 13, 342, whereas a majority of 
blogs just have a few incoming links (see Tab. QJ. The 
power-law distribution of in-degree indicates that many 
common bloggers preferentially add links to their favorite 
celebrities' blogs and such preferential behavior results in 
the power-law distribution of the in-degree just as the BA 
model describes. We found that a significant fraction of 
the blogs, that is 32.6%, have no outgoing links to other 
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FIG. 2: (Color online) Cumulative distribution of out degree 
of blogs. The plotted straight red line is of slope —2.60 ±0.02 
for comparison. The insect shows the detail log-log plot of 
the right skew tail for large out degree. The linear fit of 
that data justifies that the heavy tail obeys a power law as 
Pout(k > K)~ k~ B . 



TABLE I: Percentage of blogs with null, 1, 2 and 3 out and 
in degrees. Note that a large fraction of blogs have only small 
in and out degrees. Since our blogging network was crawled 
along the directed links, the in-degree of blogs is at least 1. 
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In 





48.4% 


18.1% 


9.8% 


Out 


32.6% 


14.2% 


9.7% 


7.5% 



blogs. Further, considerable fraction of blogs have only 
a few outgoing connections (see Tab. QJ. That's to say, 
most of the bloggers are unknown to public and they are 
not active enough in the blogosphere (have small or null 
outgoing links). 

The average degree (k) of such blogging network is 9.0, 
that's to say, for each node in such social networks has 
an average of 9 neighbors. Furthermore, the average in 
and out degrees (fc, n ) = (k ou t) = 4.5. Although there are 
millions of connections presented in the social network, 
as aforementioned, about 28.7% of them are symmetric 
and most of the symmetric links are between the blogs 
of bloggers who get acquainted with each other in the 
blogosphere. So this proves that such blogging network 
is asymmetric one: while a node tends to link to a famous 
node, it is seldom the case that the famous node would 
link to this node either. 



TABLE II: Correlation coefficients for the degrees at either 
side of an edge. Negative figures indicate that poorly con- 
nected nodes tend to link to highly connected nodes while 
positive values suggest that nodes with even connectivity are 
likely to connect to each other. 
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Table [H] displays the correlation coefficients of differ- 
ent types of degree-degree correlations for the crawled 
down blogging network. Correlations are measured by 
the Pearson's correlation coefficient r for the degrees at 
either side of an edge as suggested by Mark Newman [24| . 

^ (ktokfrom) (kfo) (kfrorn) 

V(kl) (kto) 2 ^{kj ro J - {k from y 

where k to ,kf rom could be four possible combinations of 
in and out degrees of an edge. 

Networks with assortative mixing pattern are those in 
which nodes with large degree tend to be connected to 
other nodes with many connections and vice visa. Tech- 
nical and biological networks are in general disassorta- 
tive, while social networks are often assortatively mixed 
as demonstrated by the study on scientific collaboration 
networks [24]|. Blogging network, however, presents dis- 
asortative mixing pattern when directions are not consid- 
ered. Positive mixing are shown for ri„_ ollt and r ou t-out 
in our case. Positive rj n _ OMt means active bloggers in 
the community (have large k out ) tend to associate with 
those who succeed in promoting themselves in the com- 
munity (have high fcj n ), while a large r ou t~ ou t suggests 
that the active bloggers preferentially link to each other. 
Internet dating community, a kind of social networks em- 
bedded in a technical one, and peer to peer (P2P) social 
networks are similar to our case , displaying a significant 
disassortative mixing pattern [251, |26| . 

The length of average shortest path (I) is calculated, 
which is the mean of geodesic distance between any pairs 
that have at least a path connecting them. In this case, 
(/} = 6.84. That means on average one only needs to 
click 7 times from one blog site to any other blog site 
in the blogosphere. And the diameter D of this social 
networks which is defined as the maximum of the short- 
est path length, is 27. Because such blogging network is 
directed, the clustering coefficient is not easy to be com- 
puted. One way to avoid this difficulty is to make the 
network undirected. Firstly, the one-way connections 
were removed from the network; secondly, the isolated 
nodes were deleted from the graph. By doing so, the 
bidirectional graph with 122,470 nodes was obtained to 
compute the clustering coefficient. The mean degree of 
this undirected networks k un di re cted is 3.28. According to 
the definition of clustering coefficient in undirected net- 
work, Ci = fcTprrrn ' that is the ratio between the number 
Ei of edges that actually exits between these hi neighbor- 
nodes of node i and the total number fcj(fcj — 1). The 
clustering coefficient of the whole network is the average 
of all individual Cj's. We found the clustering coeffi- 
cient C = 0.1490, order of magnitude much higher than 
that of a corresponding random graph of the same size 
C rand = 3.28/122470 = 0.0000268. Besides, the degree- 
dependent local clustering coefficient C(k) is averaging 
d over vertices of degree k. Fig. OH plots the cumulative 
distribution of C{k) from the undirected blogging net- 
work. However, it is hard to declare a clear power law 
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blogs's page views exceed tens of millions, while most of 
the remanent have only tens of thousands page views. 
This heavy-tailed distribution indicates that most of the 
readers are attracted by the celebrated bloggers and con- 
tribute page views to their blogs. However, minority of 
the grassroots' blogs could gain public attention in the 
blogosphere. In this sense, some kind of inequality devel- 
ops: the richer gets richer while the poorer gets poorer. 
Thus, social technologies not only enhance the communi- 
cations between distant people, but also facilitate the in- 
equality between the celebrated and the commons. From 
this respect, the blogosphere might be a good paradigm 
for studying the emergence of such inequality. 



FIG. 3: (Color online) Cumulative distribution of clustering 
coefficient of blogs. 




FIG. 4: (Color online) Cumulative distribution of the fraction 
of blogs of which the number of page views is more than S. 
The stright-line is of slope —0.87 for comparison with the 
distribution. 



in our case. Nevertheless, the nonflat clustering coeffi- 
cient distributions shown in the figure suggests that the 
dependency of C on k is nontrivial, and thus points to 
some degree of hierarchy in the networks. Consequently, 
it is demonstrated that the average shortest path length 
is far smaller than the logarithm of the network size in 
such blogging network. In addition, the network has rel- 
atively high clustering coefficient. Thence, the blogging 
network of inter-connected blogs has small- world effect. 
This small-world phenomenon is also consistent with the 
former small-world discovery about the WWW. 

To evaluate the popularity of the blogs, the cumula- 
tive distribution of the number of page views of blog 
sites is figured out (see Fig. 0J. For small page view 
S (S < 500), there exists saturation. However, for large 
S, the fraction of blogs that have more than total S page 
views obeys a power-law form as P(s > S) ~ S~ T , where 
t = 0.87 ± 6.56 x 10~ 4 . Immediately, one could get the 
distribution of page views of blogs as p(s) ~ s^, where 
/j, = r + 1 = 1.87. In our case of 200339 nodes, only ten 



Conclusion remarks and future work 

In summary, the sub-ecosystem of global blogosphere 
is scrutinized to reveal the underlying link pattern and 
the popularity of the blogs. We found that the blogging 
community has small-world property. In addition, the 
in-degree and out-degree distributions follow power-law 
forms. Calculations on degree-degree correlations show 
that blogging networks are in general disassortative mix- 
ing, except that active bloggers are connected between 
each other and by the ones with high in-degree. The 
fraction of number of page views of blogs also obeys a 
power law. Although our crawled down blogging net- 
work is static whereas the nature of blogosphere is the 
dynamical and evolving one, our observations and statis- 
tical analysis might be the first step to such ecosystem. 
However, what has been done is not enough. There are 
still various aspects of blogoshpere to be investigated. 
Recently, a new technique called collaborative tagging 
gains ground in blogging community because it could 
steer bloggers to effectively share tremendous amounts 
of information and find the useful information [33| • It is 
of some merit to study tag co-occurrence to reveal the 
universal characteristics of users' tagging behavior [27| . 
Moreover, the fascinating phenomenon of arising hot dis- 
cussion topics is worth examining to dig out the intrinsic 
features of collective behaviors in blogosphere. And also 
the recommendations rules of blogging creates powerful 
collaborative filtering. Thus blogosphere would be the 
suitable one to study collaborative filtering effect. Ad- 
ditionally, detecting the latent community structures in 
blogosphere would be meaningful. And also, it is inter- 
esting to study information, like rumors, propagation in 
this ecosystem. In short, self-organized blogosphere is 
a good paradigm for understanding varieties of facets of 
behavior pattern of bloggers in such ecosystem. 
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